Big Data Econometrics

2nd homework assignment

mansok (2017042)

2021-05-30

I10.9

I10.9 Consider the USArrests data. We will now perform hierarchical clustering on the states.
(a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
(b) Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
(c) Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
(d) What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

set.seed(1995)
dat1 <- USArrests

(a)

fit.hclust <- hclust(dist(dat1), method = "complete")
plot(fit.hclust, main = "States clusters")

(b)

fit.hclust.c <- cutree(fit.hclust, 3)
fit.hclust.c
   Alabama         Alaska        Arizona       Arkansas     California 
         1              1              1              2              1 
  Colorado    Connecticut       Delaware        Florida        Georgia 
         2              3              1              1              2 
    Hawaii          Idaho       Illinois        Indiana           Iowa 
         3              3              1              3              3 
    Kansas       Kentucky      Louisiana          Maine       Maryland 
         3              3              1              3              1 

Massachusetts Michigan Minnesota Mississippi Missouri 2 1 3 1 2 Montana Nebraska Nevada New Hampshire New Jersey 3 3 1 3 2 New Mexico New York North Carolina North Dakota Ohio 1 1 1 3 3 Oklahoma Oregon Pennsylvania Rhode Island South Carolina 2 2 3 2 1 South Dakota Tennessee Texas Utah Vermont 3 2 2 3 3 Virginia Washington West Virginia Wisconsin Wyoming 2 2 3 3 2

fit.hclust.c %>% table() %>% t() %>% # xtable() %>%
kable(caption = "Number of states in each cluster") %>% kable_styling(position = "left", 
    bootstrap_options = "hover")
Number of states in each cluster
1 2 3
16 14 20

(c)

dat2 <- scale(dat1)
fit.hclust.s <- hclust(dist(dat2), method = "complete")
plot(fit.hclust.s)

(d)

fit.hclust.s.c <- cutree(fit.hclust.s, 3)
fit.hclust.s.c
   Alabama         Alaska        Arizona       Arkansas     California 
         1              1              2              3              2 
  Colorado    Connecticut       Delaware        Florida        Georgia 
         2              3              3              2              1 
    Hawaii          Idaho       Illinois        Indiana           Iowa 
         3              3              2              3              3 
    Kansas       Kentucky      Louisiana          Maine       Maryland 
         3              3              1              3              2 

Massachusetts Michigan Minnesota Mississippi Missouri 3 2 3 1 3 Montana Nebraska Nevada New Hampshire New Jersey 3 3 2 3 3 New Mexico New York North Carolina North Dakota Ohio 2 2 1 3 3 Oklahoma Oregon Pennsylvania Rhode Island South Carolina 3 3 3 3 1 South Dakota Tennessee Texas Utah Vermont 3 1 2 3 3 Virginia Washington West Virginia Wisconsin Wyoming 3 3 3 3 3

fit.hclust.s.c %>% table() %>% t() %>% # xtable() %>%
kable(caption = "Number of states in each cluster (scaled data)") %>% kable_styling(position = "left", 
    bootstrap_options = "hover")
Number of states in each cluster (scaled data)
1 2 3
8 11 31
dd <- table(fit.hclust.c, fit.hclust.s.c)
rownames(dd) <- c("Non-scaled 1", "Non-scaled 2", "Non-scaled 3")
colnames(dd) <- c("Scaled 1", "Scaled 2", "Scaled 3")
dd %>% # xtable() %>%
kable(caption = "Model comparison on non-scaled vs scaled data") %>% kable_styling(position = "left", 
    bootstrap_options = "hover") %>% column_spec(1, bold = TRUE)
Model comparison on non-scaled vs scaled data
Scaled 1 Scaled 2 Scaled 3
Non-scaled 1 6 9 1
Non-scaled 2 2 2 10
Non-scaled 3 0 0 20
  • Scaling the variables have effect on the maximum height of dendogram. However the trees look similar.
    - Scaling should be done, because variables are in different units.

I4.13

I4.13 Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

General information

dat <- Boston
setDT(dat)
dat[, `:=`(crim.ind, fifelse(crim < median(crim), 0, 1))][, `:=`(crim, NULL)]
sapply(dat, summary) %>% t() %>% kable(caption = "General summary of Boston dataset variables") %>% 
    kable_styling(position = "left", bootstrap_options = c("striped", "hover")) %>% 
    column_spec(1, bold = T)
General summary of Boston dataset variables
Min. 1st Qu. Median Mean 3rd Qu. Max.
zn 0.0000 0.000000 0.00000 11.3636364 12.500000 100.0000
indus 0.4600 5.190000 9.69000 11.1367787 18.100000 27.7400
chas 0.0000 0.000000 0.00000 0.0691700 0.000000 1.0000
nox 0.3850 0.449000 0.53800 0.5546951 0.624000 0.8710
rm 3.5610 5.885500 6.20850 6.2846344 6.623500 8.7800
age 2.9000 45.025000 77.50000 68.5749012 94.075000 100.0000
dis 1.1296 2.100175 3.20745 3.7950427 5.188425 12.1265
rad 1.0000 4.000000 5.00000 9.5494071 24.000000 24.0000
tax 187.0000 279.000000 330.00000 408.2371542 666.000000 711.0000
ptratio 12.6000 17.400000 19.05000 18.4555336 20.200000 22.0000
black 0.3200 375.377500 391.44000 356.6740316 396.225000 396.9000
lstat 1.7300 6.950000 11.36000 12.6530632 16.955000 37.9700
medv 5.0000 17.025000 21.20000 22.5328063 25.000000 50.0000
crim.ind 0.0000 0.000000 0.50000 0.5000000 1.000000 1.0000
dat[, .N, crim.ind] %>% kable(caption = "Crime rate comparison to median") %>% kable_styling(position = "left", 
    bootstrap_options = "hover")
Crime rate comparison to median
crim.ind N
0 253
1 253

Create train and test sets

set.seed(1995)
train_ind <- sample(nrow(dat), nrow(dat) * 0.7)
train <- as.data.table(dat[train_ind, ])
test <- as.data.table(dat[-train_ind, ])

train[, .N, crim.ind] %>% kable(caption = "Crime rate comparison to median (train set)") %>% 
    kable_styling(position = "left", bootstrap_options = "hover")
Crime rate comparison to median (train set)
crim.ind N
1 184
0 170
test[, .N, crim.ind] %>% kable(caption = "Crime rate comparison to median (test set)") %>% 
    kable_styling(position = "left", bootstrap_options = "hover")
Crime rate comparison to median (test set)
crim.ind N
0 83
1 69

Logistic regression

All variables

fit.glm <- glm(crim.ind ~ ., data = train, family = binomial)
test$predict <- predict(fit.glm, test, type = "response")
test[, `:=`(crim.p, fifelse(predict < 0.5, 0, 1))]

table(test$crim.ind, test$crim.p) %>% # xtable() %>%
kable(caption = "Actual vs predicted") %>% kable_styling(position = "left", bootstrap_options = "hover") %>% 
    column_spec(1, bold = T)
Actual vs predicted
0 1
0 75 8
1 7 62
acc <- round(mean(test$crim.p == test$crim.ind), 3) * 100
err <- round(mean(test$crim.p != test$crim.ind), 3) * 100
cat("Test accuracy:", acc, "%; \n", "Test error rate:", err, "% \n")

Test accuracy: 90.1 %; Test error rate: 9.9 %

fit.glm %>% summary() %>% pander()
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -32.39 10.52 -3.08 0.002071
zn -0.1067 0.04466 -2.389 0.01687
indus -0.08002 0.06342 -1.262 0.207
chas 0.8943 0.9275 0.9642 0.3349
nox 54.91 10.2 5.385 7.251e-08
rm -0.1994 0.9313 -0.2141 0.8305
age 0.03965 0.01722 2.302 0.02132
dis 0.958 0.3168 3.024 0.002497
rad 0.7909 0.2016 3.922 8.77e-05
tax -0.007253 0.003429 -2.115 0.03439
ptratio 0.4935 0.1836 2.687 0.007209
black -0.04434 0.01821 -2.434 0.01492
lstat 0.03954 0.0695 0.5689 0.5694
medv 0.2064 0.09856 2.094 0.03625

(Dispersion parameter for binomial family taken to be 1 )

Null deviance: 490.2 on 353 degrees of freedom
Residual deviance: 128.0 on 340 degrees of freedom

Excluding non significant variables

fit.glm2 <- glm(crim.ind ~ . - indus - chas - rm - lstat, data = train, family = binomial)
test$predict <- predict(fit.glm2, test, type = "response")
test[, `:=`(crim.p, fifelse(predict < 0.5, 0, 1))]

table(test$crim.ind, test$crim.p) %>% # xtable() %>%
kable(caption = "Actual vs predicted") %>% kable_styling(position = "left", bootstrap_options = "hover") %>% 
    column_spec(1, bold = T)
Actual vs predicted
0 1
0 72 11
1 7 62
acc <- round(mean(test$crim.p == test$crim.ind), 3) * 100
err <- round(mean(test$crim.p != test$crim.ind), 3) * 100
cat("Test accuracy:", acc, "%; \n", "Test error rate:", err, "% \n")

Test accuracy: 88.2 %; Test error rate: 11.8 %

fit.glm2 %>% summary() %>% pander()
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -30.07 9.982 -3.013 0.002588
zn -0.111 0.04267 -2.601 0.009286
nox 49.15 8.923 5.508 3.631e-08
age 0.0401 0.01411 2.842 0.004478
dis 0.8805 0.2954 2.981 0.002872
rad 0.9039 0.1925 4.695 2.67e-06
tax -0.00912 0.003004 -3.036 0.002398
ptratio 0.4294 0.1558 2.757 0.005834
black -0.04029 0.01644 -2.45 0.01427
medv 0.1729 0.05443 3.177 0.001488

(Dispersion parameter for binomial family taken to be 1 )

Null deviance: 490.2 on 353 degrees of freedom
Residual deviance: 130.7 on 344 degrees of freedom

LDA

All variables

fit.lda <- lda(crim.ind ~ ., data = train)
test$crim.p <- predict(fit.lda, test, type = "response")$class

table(test$crim.ind, test$crim.p) %>% # xtable() %>%
kable(caption = "Actual vs predicted") %>% kable_styling(position = "left", bootstrap_options = "hover") %>% 
    column_spec(1, bold = T)
Actual vs predicted
0 1
0 77 6
1 11 58
acc <- round(mean(test$crim.p == test$crim.ind), 3) * 100
err <- round(mean(test$crim.p != test$crim.ind), 3) * 100
cat("Test accuracy:", acc, "%; \n", "Test error rate:", err, "% \n")

Test accuracy: 88.8 %; Test error rate: 11.2 %

fit.lda %>% summary() %>% pander()
  Length Class Mode
prior 2 -none- numeric
counts 2 -none- numeric
means 26 -none- numeric
scaling 13 -none- numeric
lev 2 -none- character
svd 1 -none- numeric
N 1 -none- numeric
call 3 -none- call
terms 3 terms call
xlevels 0 -none- list

Excluding some variables

fit.lda2 <- lda(crim.ind ~ . - indus - chas - rm - lstat, data = train)
test$crim.p <- predict(fit.lda2, test, type = "response")$class

table(test$crim.ind, test$crim.p) %>% # xtable() %>%
kable(caption = "Actual vs predicted") %>% kable_styling(position = "left", bootstrap_options = "hover") %>% 
    column_spec(1, bold = T)
Actual vs predicted
0 1
0 77 6
1 10 59
acc <- round(mean(test$crim.p == test$crim.ind), 3) * 100
err <- round(mean(test$crim.p != test$crim.ind), 3) * 100
cat("Test accuracy:", acc, "%; \n", "Test error rate:", err, "% \n")

Test accuracy: 89.5 %; Test error rate: 10.5 %

fit.lda2 %>% summary() %>% pander()
  Length Class Mode
prior 2 -none- numeric
counts 2 -none- numeric
means 18 -none- numeric
scaling 9 -none- numeric
lev 2 -none- character
svd 1 -none- numeric
N 1 -none- numeric
call 3 -none- call
terms 3 terms call
xlevels 0 -none- list

KNN

# select features
nm <- which(!names(dat) %in% c("indus", "chas", "rm", "lstat"))
dat.knn <- copy(dat[, nm, with = FALSE])

# transform variables
nm <- names(dat.knn)
nm <- nm[!grepl("crim.ind", nm)]
dat.knn[, `:=`((nm), lapply(.SD, scale)), .SDcols = nm]

# split train - test sets
train.knn <- dat.knn[train_ind, ]
test.knn <- dat.knn[-train_ind, ]

K=1

# KNN with K=1
crim.p <- knn(train.knn[, !"crim.ind"], test.knn[, !"crim.ind"], train.knn[, crim.ind], 
    k = 1)
table(test.knn[, crim.ind], crim.p) %>% # xtable() %>%
kable(caption = "Actual vs predicted") %>% kable_styling(position = "left", bootstrap_options = "hover") %>% 
    column_spec(1, bold = T)
Actual vs predicted
0 1
0 75 8
1 4 65
acc <- round(mean(crim.p == test.knn$crim.ind), 3) * 100
err <- round(mean(crim.p != test.knn$crim.ind), 3) * 100
cat("Test accuracy:", acc, "%; \n", "Test error rate:", err, "% \n")

Test accuracy: 92.1 %; Test error rate: 7.9 %

K=3

# KNN with K=3
crim.p3 <- knn(train.knn[, !"crim.ind"], test.knn[, !"crim.ind"], train.knn[, crim.ind], 
    k = 3)
table(test.knn[, crim.ind], crim.p3) %>% # xtable() %>%
kable(caption = "Actual vs predicted") %>% kable_styling(position = "left", bootstrap_options = "hover") %>% 
    column_spec(1, bold = T)
Actual vs predicted
0 1
0 78 5
1 5 64
acc <- round(mean(crim.p3 == test.knn$crim.ind), 3) * 100
err <- round(mean(crim.p3 != test.knn$crim.ind), 3) * 100
cat("Test accuracy:", acc, "%; \n", "Test error rate:", err, "% \n")

Test accuracy: 93.4 %; Test error rate: 6.6 %

K=5

# KNN with K=5
crim.p5 <- knn(train.knn[, !"crim.ind"], test.knn[, !"crim.ind"], train.knn[, crim.ind], 
    k = 5)
table(test.knn[, crim.ind], crim.p5) %>% # xtable() %>%
kable(caption = "Actual vs predicted") %>% kable_styling(position = "left", bootstrap_options = "hover") %>% 
    column_spec(1, bold = T)
Actual vs predicted
0 1
0 81 2
1 4 65
acc <- round(mean(crim.p5 == test.knn$crim.ind), 3) * 100
err <- round(mean(crim.p5 != test.knn$crim.ind), 3) * 100
cat("Test accuracy:", acc, "%; \n", "Test error rate:", err, "% \n")

Test accuracy: 96.1 %; Test error rate: 3.9 %

KNN performed best.

I9.5

I9.5 We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.(a) Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them. For instance, you can do this as follows:
> x1=runif (500) -0.5
> x2=runif (500) -0.5
> y=1*( x1^2-x2^2 > 0)
(b) Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the yaxis.
(c) Fit a logistic regression model to the data, using X1 and X2 as predictors.
(d) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.
(e) Now fit a logistic regression model to the data using non-linear functions of \(X_1\) and \(X_2\) as predictors (e.g. \(X^2_1\) , \(X_1 \times X_2\), \(log(X_2)\), and so forth).
(f) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.
(g) Fit a support vector classifier to the data with \(X_1\) and \(X_2\) as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
(h) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
(i) Comment on your results.

(a)

set.seed(1995)
x1 <- runif(500) - 0.5
x2 <- runif(500) - 0.5
y <- 1 * (x1^2 - x2^2 > 0)

(b)

dat9 <- data.table(as.factor(y), x1, x2)
# plot_ly(dd, x=~x1) %>% add_markers(y = ~x2, showlegend = FALSE)


fig <- plot_ly(dat9, x = ~x1, y = ~x2, color = ~y, symbol = ~y, type = "scatter", 
    mode = "markers", hoverinfo = "text", marker = list(size = 7), text = ~paste("</br> Y class: ", 
        y, "</br> X1: ", round(x1, 3), "</br> X2: ", round(x2, 3)))

fig <- fig %>% layout(title = "X1 vs X2", xaxis = list(title = "X1"), yaxis = list(title = "X2"))
fig

(c)

fit.glm9 <- glm(y ~ x1 + x2, family = binomial)
fit.glm9 %>% summary() %>% pander()
  Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.06214 0.08976 0.6922 0.4888
x1 -0.05384 0.3139 -0.1715 0.8638
x2 -0.3561 0.3165 -1.125 0.2604

(Dispersion parameter for binomial family taken to be 1 )

Null deviance: 692.8 on 499 degrees of freedom
Residual deviance: 691.5 on 497 degrees of freedom

(d)

dat9$predict <- predict(fit.glm9, dat9, type = "response")
dat9[, `:=`(y.p, fifelse(predict < 0.5, 0, 1))]

fig <- plot_ly(dat9, x = ~x1, y = ~x2, color = ~y.p, symbol = ~y.p, type = "scatter", 
    mode = "markers", hoverinfo = "text", marker = list(size = 7), text = ~paste("</br> Real: ", 
        y, "</br> Predicted: ", y.p, "</br> X1: ", round(x1, 3), "</br> X2: ", round(x2, 
            3)))

fig <- fig %>% layout(title = "Logistic regression results", xaxis = list(title = "X1"), 
    yaxis = list(title = "X2"))
fig

(e)

fit.glm92 <- glm(y ~ x1 + x2 + I(x1^2) + I(x2^2) + I(x1 * x2), data = dat9, family = binomial)
fit.glm92 %>% summary() %>% pander()
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.727 478 -0.01198 0.9904
x1 48.12 7921 0.006075 0.9952
x2 37.57 9631 0.003901 0.9969
I(x1^2) 25208 597146 0.04221 0.9663
I(x2^2) -24493 579277 -0.04228 0.9663
I(x1 * x2) 229 30926 0.007405 0.9941

(Dispersion parameter for binomial family taken to be 1 )

Null deviance: 6.928e+02 on 499 degrees of freedom
Residual deviance: 6.313e-06 on 494 degrees of freedom

(f)

dat9$predict2 <- predict(fit.glm92, dat9, type = "response")
dat9[, `:=`(y.p2, fifelse(predict2 < 0.5, 0, 1))]

fig <- plot_ly(dat9, x = ~x1, y = ~x2, color = ~y.p2, symbol = ~y.p2, type = "scatter", 
    mode = "markers", hoverinfo = "text", marker = list(size = 7), text = ~paste("</br> Real: ", 
        y, "</br> Predicted: ", y.p2, "</br> X1: ", round(x1, 3), "</br> X2: ", round(x2, 
            3)))

fig <- fig %>% layout(title = "Logistic regression results", xaxis = list(title = "X1"), 
    yaxis = list(title = "X2"))
fig

(g)

fit.svm <- svm(as.factor(y) ~ x1 + x2, dat9, kernel = "linear", cost = 0.01)
dat9$y.p3 <- predict(fit.svm, dat9, type = "response")

fig <- plot_ly(dat9, x = ~x1, y = ~x2, color = ~y.p3, symbol = ~y.p3, type = "scatter", 
    mode = "markers", hoverinfo = "text", marker = list(size = 7), text = ~paste("</br> Real: ", 
        y, "</br> Predicted: ", y.p3, "</br> X1: ", round(x1, 3), "</br> X2: ", round(x2, 
            3)))

fig <- fig %>% layout(title = "SVM results", xaxis = list(title = "X1"), yaxis = list(title = "X2"))
fig

(h)

fit.svm2 = svm(as.factor(y) ~ x1 + x2, dat9, gamma = 1)
dat9$y.p4 <- predict(fit.svm2, dat9, type = "response")

fig <- plot_ly(dat9, x = ~x1, y = ~x2, color = ~y.p4, symbol = ~y.p4, type = "scatter", 
    mode = "markers", hoverinfo = "text", marker = list(size = 7), text = ~paste("</br> Real: ", 
        y, "</br> Predicted: ", y.p4, "</br> X1: ", round(x1, 3), "</br> X2: ", round(x2, 
            3)))

fig <- fig %>% layout(title = "SVM results", xaxis = list(title = "X1"), yaxis = list(title = "X2"))
fig

Bonus Tasks

1. Text document analysis

  1. Select any your own written document, preferably in English (Lithuanian could be done in UTF-8 encoding) or an article in Econometrics. Write an R code that selects the most frequent words in the paper and also for each subsection. Draw the visual expression of the words. Can the topic of the subsection be guessed from the visual representation of the most frequent words? Are the topics related by some keywords?

Words frequency

dat.pdf <- Corpus(URISource("straipsnis.pdf"), readerControl = list(reader = readPDF))

dtm <- TermDocumentMatrix(dat.pdf, control = list(removePunctuation = TRUE, stopwords = TRUE, 
    tolower = TRUE, stemming = TRUE, removeNumbers = TRUE, bounds = list(global = c(1, 
        Inf))))

mm <- as.matrix(dtm)
ww <- sort(rowSums(mm), decreasing = TRUE)
dd <- data.frame(word = names(ww), freq = ww)

head(dd, 10) %>% kable(caption = "10 most popular words") %>% kable_styling(position = "left", 
    bootstrap_options = "hover")
10 most popular words
word freq
path path 32
weight weight 30
rout rout 29
algorithm algorithm 24
shortest shortest 24
time time 21
road road 20
electr electr 18
energi energi 17
vehicl vehicl 17

Words frequency plot

set.seed(1995)
wordcloud(words = dd$word, freq = dd$freq, min.freq = 5, max.words = 200, random.order = FALSE, 
    rot.per = 0.35, colors = brewer.pal(8, "Dark2"))

The paper is about Dijkstra (weighted shortest path algorithm) application for electric vehicles route planning. Frequency graph represents the document.

Words frequency plot by section

dat.pdf.c <- VCorpus(VectorSource(pdf_text("straipsnis.pdf")))

dtm.c <- TermDocumentMatrix(dat.pdf.c, control = list(removePunctuation = TRUE, stopwomm2rds = TRUE, 
    tolower = TRUE, stemming = TRUE, removeNumbers = TRUE, bounds = list(global = c(1, 
        Inf))))

mm2 <- as.matrix(dtm.c)

for (i in 1:(ncol(mm2))) {
    stat <- mm2[, i]
    v <- sort(stat, decreasing = TRUE)
    d <- data.frame(word = names(v), freq = v)
    
    wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words = 20, random.order = FALSE, 
        rot.per = 0.35, colors = brewer.pal(8, "Dark2"))
}

Visual representation of subsections is nor very informative. But it is possible to understand general idea of subsection.

2. Barro-Lee example

  1. In library(hdm)() in R. Work with Barro-Lee* example on the convergence dataset, where randomly drop a subset of 20 countries. For random generator set.seed(yyyymmdd), where letters are your birthday. post-squared root LASSO; compute full OLS estimate; make a double selection with partialling out and double selection routine. Compare the results for the inference of the treatment variable (a third variable in the dataset), which reflects the conditional convergence hypothesis. Apply the same estimations for the full dataset. Are the results robust to the exclusion of 20 countries? For additional details see 12 p. in vignette(‘hdm_vignette’).

Dropping 20 countries

set.seed(19950330)

dat.b <- GrowthData
setDT(dat.b)
drop.ind <- sample(1:(nrow(dat.b)), 20)
dat.b <- dat.b[!drop.ind, ]

dim.b <- dim(dat.b)
cat("Data dimensions after dropping 20 countries: rows:", dim.b[1], "; columns:", 
    dim.b[2], ". \n")

Data dimensions after dropping 20 countries: rows: 70 ; columns: 63 .

# from vignette
y <- dat.b[, 1, drop = F]
d <- dat.b[, 3, drop = F]
X <- as.matrix(dat.b)[, -c(1, 2, 3)]
varnames <- colnames(dat.b)

# OLS:
xnames <- varnames[-c(1, 2, 3)]
dandxnames <- varnames[-c(1, 2)]

# create formulas by pasting names (this saves typing times)
fmla <- as.formula(paste("Outcome ~ ", paste(dandxnames, collapse = "+")))
ls.effect <- lm(fmla, data = dat.b)

# effect by the partialling out by Post-Lasso:
lasso.effect <- rlassoEffect(x = X, y = y, d = d, method = "partialling out")
summary(lasso.effect)

[1] “Estimates and significance testing of the effect of target variables” Estimate. Std. Error t value Pr(>|t|)
[1,] -0.03932 0.02066 -1.903 0.057 . — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

# double selection:
doublesel.effect <- rlassoEffect(x = X, y = y, d = d, method = "double selection")
summary(doublesel.effect)

[1] “Estimates and significance testing of the effect of target variables” Estimate. Std. Error t value Pr(>|t|)
gdpsh465 -0.03932 0.02097 -1.875 0.0608 . — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

table <- rbind(summary(ls.effect)$coef["gdpsh465", 1:2], summary(lasso.effect)$coef[, 
    1:2], summary(doublesel.effect)$coef[, 1:2])
colnames(table) <- c("Estimate", "Std. Error")
rownames(table) <- c("OLS", "Post-Lasso ", "Double selection")

table %>% # xtable(digits=c(2, 2,5))%>% xtable() %>%
kable(caption = "Models comparison") %>% kable_styling(position = "left", bootstrap_options = "hover") %>% 
    column_spec(1, bold = T)
Models comparison
Estimate Std. Error
OLS -0.0287775 0.0477123
Post-Lasso -0.0393177 0.0206559
Double selection -0.0393177 0.0209711

Full dataset

dat.b.f <- GrowthData
setDT(dat.b.f)

dim.b.f <- dim(dat.b.f)
cat("Full dataset dimensions: rows:", dim.b.f[1], "; columns:", dim.b.f[2], ". \n")

Full dataset dimensions: rows: 90 ; columns: 63 .

# from vignette
y <- dat.b.f[, 1, drop = F]
d <- dat.b.f[, 3, drop = F]
X <- as.matrix(dat.b.f)[, -c(1, 2, 3)]
varnames <- colnames(dat.b.f)

# OLS:
xnames <- varnames[-c(1, 2, 3)]
dandxnames <- varnames[-c(1, 2)]

# create formulas by pasting names (this saves typing times)
fmla <- as.formula(paste("Outcome ~ ", paste(dandxnames, collapse = "+")))
ls.effect <- lm(fmla, data = dat.b.f)

# effect by the partialling out by Post-Lasso:
lasso.effect <- rlassoEffect(x = X, y = y, d = d, method = "partialling out")
summary(lasso.effect)

[1] “Estimates and significance testing of the effect of target variables” Estimate. Std. Error t value Pr(>|t|)
[1,] -0.04981 0.01394 -3.574 0.000351 *** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

# double selection:
doublesel.effect <- rlassoEffect(x = X, y = y, d = d, method = "double selection")
summary(doublesel.effect)

[1] “Estimates and significance testing of the effect of target variables” Estimate. Std. Error t value Pr(>|t|)
gdpsh465 -0.05001 0.01579 -3.167 0.00154 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

table <- rbind(summary(ls.effect)$coef["gdpsh465", 1:2], summary(lasso.effect)$coef[, 
    1:2], summary(doublesel.effect)$coef[, 1:2])
colnames(table) <- c("Estimate", "Std. Error")
rownames(table) <- c("OLS", "Post-Lasso ", "Double selection")

table %>% kable(caption = "Models comparison") %>% kable_styling(position = "left", 
    bootstrap_options = "hover") %>% column_spec(1, bold = T)
Models comparison
Estimate Std. Error
OLS -0.0093780 0.0298877
Post-Lasso -0.0498115 0.0139364
Double selection -0.0500059 0.0157914